DiTASiC (Differential Taxa Abundance including Similarity Correction) is designed as a comprehensive approach for abundance estimation and differential abundance assessment of individual taxa in metagenomics samples. It provides resolution on the strain level, which is characterized by the presence of highly similar genome sequences. The challenge arises that many reads match to multiple genomes equally, causing a significant bias of abundance counts.

A generalized linear model is introduced for the resolution of the shared read counts and allows inference of accurate taxa abundances. In addition, abundance estimation uncertainties are captured, which play a crucial role in differential abundance analysis. A novel statistical framework is built, which integrates the abundance variance and infers abundance distributions for differential testing sensitive to strain level. No prior assumptions on overall abundance change are required. A resulting list of tested taxa is reported with estimated abundances, fold-changes and p-values to infer significance.

 

title for image

Workflow of DiTASiC: it consists of the three main parts of (1) mapping, (2) taxa abundance estimation, and (3) differential abundance assessment. (1) We rely on prior pre-filtering of species by fast external profiling tools for identification. Reads are mapped to the given reference sequences and the number of matching reads per reference are counted (*mapping abundance*). A similarity matrix reflecting the genome similarities is constructed. (2) A generalized linear model (GLM) is built for resolution of read count ambiguities, resulting in corrected abundance estimates along with standard errors. (3) For the comparison of metagenomes, abundances are formulated as distributions and their divergence reflects differential events. A final list of tested taxa is reported.

Download

Dependencies

DiTASiC requires Python 3 and R installation (version ≥ 3.3.1).

Internally, the read simulator Mason and the pseudoaligner kallisto are used and need to be available in the PATH or DiTASiC’s runtime directory. Binaries for both programs for x86-64 systems are provided in the .tar-package for easy installation.

Usage

The following input files are required to run the pipeline:

Optional:

kallisto index -i index_name Reference-FASTA-files

DiTASiC comprises four main commands:

  1. Creating the similarity matrix of the reference sequences.

    The key element is to imitate read and mapping characteristics as good as possible to reflect the source of ambiguities in the matrix. At minimum, a parameter determining the read length is obligatory, corresponding to the average read length found in the samples considered. Further parameters defining mismatch probabilities are optional, but are recommended (see Manual). Additionally, a file REFS_PATH holding the paths to the fasta-files of the references is required.

    ditasic_matrix -l 100 REFS_PATH

    Default output: similarity_matrix.npy, a square N x N - matrix for the N references listed in the given path-file REFS_PATH. ditasic_matrix also generates a kallisto index of all references, if no index is given via the -i flag.

  2. Mapping the reads to the reference genome sequences.

    In the mapping step, all reads of a sample are preliminarily assigned to the given reference sequences. For the mapping, internally, a pseudo-aligner as part of the kallisto framework (command kallisto pseudo, (see ‘Formats’ in Manual)) is used. Subsequently, the number of read hits per reference genome (both unique and shared hits) are counted and stored in a vector of length N according to the number of references considered.

    ditasic_mapping -i INDEX -l 100 REFS_PATH SAMPLE

    Default output: a numpy vector SAMPLE_mapped_counts.npy and a file SAMPLE_total.npy with the total number of reads mapped. This mapping step needs to be run individually for each sample.

  3. Taxa abundance estimation in one sample.

    Apply this shorter command for estimation of taxa abundances in one sample. If you are considering more samples and their comparison go to step (4).

    The mapping counts obtained in previous step (2) are strongly biased in the presence of similar genome sequences. An abundance resolution down to strain/sub-strain level is provided by a GLM resolving the shared read counts. The files generated by the previous commands (1-2) serve as input: similarity matrix, mapping counts and total number of mapped reads. Optional arguments concerning a filtering step to identify absent taxa can be found in the Manual.

    ditasic -r REFS_PATH -a similarity_matrix.npy -x SAMPLE1_mapped_counts.npy -n SAMPLE1_total.npy -o OUTPUT_filename [options]

    Default output: a tab-separated file, listing the taxa with abundance count estimates along with corresponding error estimates, reporting the range of accuracy of the abundance estimates, and absent taxa flagged.

  4. Taxa abundance estimation and differential assessment between two samples.

    Here, abundances of taxa are estimated for two metagenomic samples and are additionally compared. The GLM model is applied for abundance resolution. Subsequently, a statistical framework identifies which taxa significantly change their abundance from one metagenome to another and which hold a constant abundance.
    The same input data is used as in (3), but is extended by mapping counts and total number of reads of a second sample.

    ditasic -r REFS_PATH -a similarity_matrix.npy -x SAMPLE1_mapped_counts.npy -n SAMPLE1_total.npy -y SAMPLE2_mapped_counts.npy -m SAMPLE2_total.npy -o OUTPUT_filename [options]

    Default output: a tab-separated file, listing all taxa considered with estimated abundances for both samples, log2 fold-changes for differential evaluation and adjusted p-values to infer significance of differential abundance.

Manual

Further details about all available command-line options can be found in the Manual.

Example workflow

You can download a directory with sample data and a script executing a typical DiTASiC workflow here. It contains 35 bacterial reference sequences and two samples of reads that were generated from the references. The example script estimates the abundance of the first sample and the differential abundance between the first and the second sample, while taking care of all preliminary steps such as similarity matrix creation or mapping of the samples. The script example.sh requires the ditasic, ditasic_matrix.py and ditasic_mapping.py scripts to be available in your PATH.

The sample script demonstrates the three essential steps of DiTASiC and allows inspecting all temporary files that are created in a successful run of the program. The final program output can be compared to the expected output and should be identical.